SYS 6018 | Spring 2022 | University of Virginia


1 Introduction

During the 2010 Earthquake in Haiti, many people lost their homes. These homeless people generally live under blue TARP. Our task was to train the model to quickly locate these homeless people in aerial images, in other words to locate blue Tarp.When we know where the homeless are, we can deploy our resources more effectively. More accurate estimates of the number of people affected. Aerial images are RGB three-channel images, and each sample is a pixel point. In the data, the classification of each sample has been marked, so I use supervised learning methods.

2 Training Data / EDA

Load data, explore data, etc.

# Load Required Packages
library(tidyverse)
library(glmnet)
library(yardstick)
library(broom)
library(plotly)
library(nnet)
library(e1071)  # functions for svm
library(pROC) # functions for AUROC
library(naivebayes) # package for naivebayes
library(caret)
library(MASS) # package for lda and qda
library(purrr) # package for thresholds searching
library(randomForest) # package for randomforest
library(gbm) # package for gbm
#-- Load Data and add binary outcome
data = read_csv('HaitiTraining.csv') %>%
mutate(bluetarp = ifelse(Class == 'Blue Tarp', 1L, 0L) %>% factor())

#-- EDA & Explore data
head(data,n=50)
#> # A tibble: 50 × 5
#>    Class        Red Green  Blue bluetarp
#>    <chr>      <dbl> <dbl> <dbl> <fct>   
#>  1 Vegetation    64    67    50 0       
#>  2 Vegetation    64    67    50 0       
#>  3 Vegetation    64    66    49 0       
#>  4 Vegetation    75    82    53 0       
#>  5 Vegetation    74    82    54 0       
#>  6 Vegetation    72    76    52 0       
#>  7 Vegetation    71    72    51 0       
#>  8 Vegetation    69    70    49 0       
#>  9 Vegetation    68    70    49 0       
#> 10 Vegetation    67    70    50 0       
#> # … with 40 more rows
ggplot(data) + 
  geom_bar(mapping = aes(x = Class))

plot_ly(x=data$Red, y=data$Green, z=data$Blue, type="scatter3d", mode="markers", color=data$Class) %>% layout(autosize = F, width = 500, height = 500)
#-- The ratio of bluetarp and non-bluetarp
ggplot(data) + 
  geom_bar(mapping = aes(x = bluetarp))

plot_ly(x=data$Red, y=data$Green, z=data$Blue, type="scatter3d", mode="markers", color=data$bluetarp) %>% layout(autosize = F, width = 500, height = 500)
#-- Make function to calculate confusion matrix metrics
get_conf <- function(thres, y, p_hat) {
yhat = ifelse(p_hat >= thres, 1L, 0L)
tibble(threshold = thres,
TP = sum(y == 1 & yhat == 1),
FP = sum(y == 0 & yhat == 1),
FN = sum(y == 1 & yhat == 0),
TN = sum(y == 0 & yhat == 0)
)
}

3 Model Training

3.1 Set up cross-validation folds

set.seed(2021)
nfolds = 10 # number of folds
n= nrow(data) # number of observations in training dataset
folds = sample(rep(1:nfolds, length = n))
train_control <- trainControl(
  method = "cv", 
  number = nfolds
  )

3.2 Naive Bayes

3.2.1 Tuning Parameter usekernel,adjust,fL

There are three hyperparameters in naive bayes model usekernel: parameter allows us to use a kernel density estimate for continuous variables versus a guassian density estimate, adjust: allows us to adjust the bandwidth of the kernel density (larger numbers mean more flexible density estimate), fL: allows us to incorporate the Laplace smoother.

# set up tuning grid
search_grid <- expand.grid(
usekernel = c(TRUE, FALSE),
fL = 0:5,
adjust = seq(0, 5, by = 1)
)

# train model
model <- train(
x = data[c("Red","Green","Blue")],
y = data$bluetarp,
method = "nb",
trControl = train_control,
tuneGrid = search_grid
)
# top 5 modesl
model$results %>%
top_n(50, wt = Accuracy) %>%
arrange(desc(Accuracy))
#>    usekernel fL adjust  Accuracy     Kappa   AccuracySD    KappaSD
#> 1       TRUE  0      2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 2       TRUE  1      2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 3       TRUE  2      2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 4       TRUE  3      2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 5       TRUE  4      2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 6       TRUE  5      2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 7       TRUE  0      3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 8       TRUE  1      3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 9       TRUE  2      3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 10      TRUE  3      3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 11      TRUE  4      3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 12      TRUE  5      3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 13      TRUE  0      1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 14      TRUE  1      1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 15      TRUE  2      1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 16      TRUE  3      1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 17      TRUE  4      1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 18      TRUE  5      1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 19      TRUE  0      4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 20      TRUE  1      4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 21      TRUE  2      4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 22      TRUE  3      4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 23      TRUE  4      4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 24      TRUE  5      4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 25     FALSE  0      0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 26     FALSE  0      1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 27     FALSE  0      2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 28     FALSE  0      3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 29     FALSE  0      4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 30     FALSE  0      5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 31     FALSE  1      0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 32     FALSE  1      1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 33     FALSE  1      2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 34     FALSE  1      3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 35     FALSE  1      4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 36     FALSE  1      5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 37     FALSE  2      0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 38     FALSE  2      1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 39     FALSE  2      2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 40     FALSE  2      3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 41     FALSE  2      4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 42     FALSE  2      5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 43     FALSE  3      0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 44     FALSE  3      1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 45     FALSE  3      2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 46     FALSE  3      3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 47     FALSE  3      4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 48     FALSE  3      5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 49     FALSE  4      0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 50     FALSE  4      1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 51     FALSE  4      2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 52     FALSE  4      3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 53     FALSE  4      4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 54     FALSE  4      5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 55     FALSE  5      0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 56     FALSE  5      1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 57     FALSE  5      2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 58     FALSE  5      3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 59     FALSE  5      4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 60     FALSE  5      5 0.9707152 0.1502549 0.0006506225 0.03567168
# plot search grid results
plot(model)

Naive_Bayes_Model <- model

3.2.2 Best Model of Naive Bayes

NB_Prediction = predict(Naive_Bayes_Model,newdata= data[c("Red","Green","Blue")])
nb.cm = table(NB_Prediction, data$bluetarp)
nb.cm
#>              
#> NB_Prediction     0     1
#>             0 61200  1139
#>             1    19   883
FP = nb.cm[1,2]
FN = nb.cm[2,1]
TP = nb.cm[2,2]
TN = nb.cm[1,1]
#get the accuracy precision FPR and FPR
nb.acc = (TP + TN)/(FP + FN + TP + TN)
nb.pre =  TP/(TP + FP)
nb.tpr = TP/(TP+FN)
nb.fpr = FP/(FP + TN)
nb.auc = auc(as.numeric(data$bluetarp), as.numeric(NB_Prediction))
nb.pre = NB_Prediction

3.3 LDA

LDA_Model <- lda(bluetarp ~ Red + Green + Blue, data = data)
LDA_Prediction = predict(LDA_Model,newdata= data[c("Red","Green","Blue")])
plot(LDA_Model)

LDA_yhat <- LDA_Prediction$class
lda.cm = table(LDA_yhat, data$bluetarp)
print(lda.cm)
#>         
#> LDA_yhat     0     1
#>        0 60607   402
#>        1   612  1620
FP = lda.cm[1,2]
FN = lda.cm[2,1]
TP = lda.cm[2,2]
TN = lda.cm[1,1]
#get the accuracy precision FPR and FPR
lda.acc = (TP + TN)/(FP + FN + TP + TN)
lda.pre =  TP/(TP + FP)
lda.tpr = TP/(TP+FN)
lda.fpr = FP/(FP + TN)
lda.auc = auc(as.numeric(data$bluetarp), as.numeric(LDA_Prediction$class))

3.4 QDA

QDA_Model <- qda(bluetarp ~ Red + Green + Blue, data = data)
QDA_Prediction = predict(QDA_Model,newdata= data[c("Red","Green","Blue")])
QDA_yhat <- QDA_Prediction$class
qda.cm = table(QDA_yhat, data$bluetarp)
print(qda.cm)
#>         
#> QDA_yhat     0     1
#>        0 61201   323
#>        1    18  1699
FP = qda.cm[1,2]
FN = qda.cm[2,1]
TP = qda.cm[2,2]
TN = qda.cm[1,1]
#get the accuracy precision FPR and FPR
qda.acc = (TP + TN)/(FP + FN + TP + TN)
qda.pre =  TP/(TP + FP)
qda.tpr = TP/(TP+FN)
qda.fpr = FP/(FP + TN)
qda.auc = auc(as.numeric(data$bluetarp), as.numeric(LDA_Prediction$class))

3.5 Logistic Regression

3.6 KNN

First I’m going to train the KNN model. KNN(k nearest neighbor) is one of the simplest machine learning method. For each data point that needs to be predicted, KNN will look for k sample points closest to it and obtain the classification of prediction points by counting the class of K neighbor sample points. The distance between different sample is depand on their euclidean distance. K is the tuning parameter in this model.It determines the degree of freedom of the model, and as K increases the degree of freedom of the model decreases. ### Tuning Parameter \(k\)

trControl <- trainControl(method  = "cv",
                          number  = 5)
KNN_Model <- train(bluetarp ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:20),
             trControl  = trControl,
             metric     = "Accuracy",
             data       =  data[c("Red","Green","Blue","bluetarp")])
print(KNN_Model)
#> k-Nearest Neighbors 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 50593, 50593, 50592, 50592, 50594 
#> Resampling results across tuning parameters:
#> 
#>   k   Accuracy   Kappa    
#>    1  0.9967426  0.9470160
#>    2  0.9969482  0.9506523
#>    3  0.9971063  0.9532870
#>    4  0.9970430  0.9523247
#>    5  0.9972486  0.9556651
#>    6  0.9971221  0.9536726
#>    7  0.9971221  0.9537533
#>    8  0.9969798  0.9514218
#>    9  0.9970114  0.9519205
#>   10  0.9970589  0.9526353
#>   11  0.9971221  0.9535729
#>   12  0.9971063  0.9533450
#>   13  0.9971537  0.9541309
#>   14  0.9968849  0.9498317
#>   15  0.9969640  0.9509849
#>   16  0.9968849  0.9497238
#>   17  0.9969640  0.9510128
#>   18  0.9969798  0.9512271
#>   19  0.9969324  0.9505657
#>   20  0.9969324  0.9505566
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 5.
plot(KNN_Model)

k = 5 is the best parameters for this knn model.

3.6.1 Train the Best KNN Model

KNN_Model <- FNN::knn(train = scale(data[c("Red","Green","Blue")]), 
                test  = scale(data[c("Red","Green","Blue")]),
                cl    = data$bluetarp,
                k     = 5,
                prob  = TRUE)
 
 ##create confusion matrix
knn.cm <- table(KNN_Model, data$bluetarp)
print(knn.cm)
#>          
#> KNN_Model     0     1
#>         0 61150    60
#>         1    69  1962
FP = knn.cm[1,2]
FN = knn.cm[2,1]
TP = knn.cm[2,2]
TN = knn.cm[1,1]
#get the accuracy precision FPR and FPR
knn.acc = (TP + TN)/(FP + FN + TP + TN)
knn.pre =  TP/(TP + FP)
knn.tpr = TP/(TP+FN)
knn.fpr = FP/(FP + TN)
knn.auc = auc(as.numeric(data$bluetarp), as.numeric(KNN_Model))

3.6.2 Tuning Parameters

3.7 Support Vector Machines (SVM)

3.7.1 Tuning Parameters

#- Set tuning parameter grid search
grid = expand_grid(
kernel = "radial",
gamma = seq(1/8, 2, length=4),
cost = 10^seq(-2,5,length=4)
)
#- Iterate over folds
perf = tibble() # to save performance metric
for(i in 1:3){ # NOTE: using 3 folds
#-- Set training/test data for fold i
test = which(folds == i) # indices of holdout/validation data
train = which(folds != i) # indices of fitting/training data
n.val = length(test) # number of observations in validation
#-- loop over tuning grid
for(j in 1:nrow(grid)){
tpars = grid[j,]
#: fit model (i.e., estimate model parameters)
fit = e1071::svm(bluetarp ~ Red + Green + Blue,
data = data[train,],
type = "C-classification",
probability=TRUE, # enable probability output
#: tuning parameters
scale = TRUE,
kernel = tpars$kernel,
gamma = tpars$gamma,
cost = tpars$cost)
#: estimate probability in hold out set
p_hat = predict(fit, data[test,], probability = TRUE) %>%
attr("probabilities") %>% .[,"1"]
#: evaluate performance
log_loss = yardstick::mn_log_loss_vec(data$bluetarp[test], p_hat,
event_level = "second")
AUROC = yardstick::roc_auc_vec(data$bluetarp[test], p_hat,
event_level = "second")
eval = tibble(log_loss, AUROC) %>%
mutate(fold = i, tuning = j,
n.val,
kernel=tpars$kernel,
gamma=tpars$gamma,
cost=tpars$cost)
#: update results
perf = bind_rows(perf, eval)
}
}
perf
#> # A tibble: 48 × 8
#>    log_loss AUROC  fold tuning n.val kernel gamma      cost
#>       <dbl> <dbl> <int>  <int> <int> <chr>  <dbl>     <dbl>
#>  1  0.0229  0.997     1      1  6325 radial 0.125      0.01
#>  2  0.0123  1.00      1      2  6325 radial 0.125      2.15
#>  3  0.00888 1.00      1      3  6325 radial 0.125    464.  
#>  4  0.00733 1.00      1      4  6325 radial 0.125 100000   
#>  5  0.0213  0.999     1      5  6325 radial 0.75       0.01
#>  6  0.0103  1.00      1      6  6325 radial 0.75       2.15
#>  7  0.00751 1.00      1      7  6325 radial 0.75     464.  
#>  8  0.00847 1.00      1      8  6325 radial 0.75  100000   
#>  9  0.0220  0.999     1      9  6325 radial 1.38       0.01
#> 10  0.0100  1.00      1     10  6325 radial 1.38       2.15
#> # … with 38 more rows
(avg_perf = perf %>%
group_by(tuning, kernel, gamma, cost) %>%
summarize(
avg_log_loss = mean(log_loss),
avg_AUROC = mean(AUROC),
sd_log_loss = sd(log_loss),
sd_AUROC = sd(AUROC)
) %>%
arrange(avg_log_loss, -avg_AUROC))
#> # A tibble: 16 × 8
#> # Groups:   tuning, kernel, gamma [16]
#>    tuning kernel gamma      cost avg_log_loss avg_AUROC sd_log_loss  sd_AUROC
#>     <int> <chr>  <dbl>     <dbl>        <dbl>     <dbl>       <dbl>     <dbl>
#>  1     15 radial 2        464.        0.00652     1.00     0.000831 0.0000583
#>  2      7 radial 0.75     464.        0.00680     1.00     0.000977 0.0000532
#>  3     11 radial 1.38     464.        0.00683     1.00     0.000863 0.0000710
#>  4      4 radial 0.125 100000         0.00694     1.00     0.000744 0.0000423
#>  5      3 radial 0.125    464.        0.00777     1.00     0.00122  0.0000801
#>  6      8 radial 0.75  100000         0.00813     1.00     0.000328 0.0000996
#>  7     12 radial 1.38  100000         0.00847     1.00     0.00108  0.0000729
#>  8     10 radial 1.38       2.15      0.00877     1.00     0.00151  0.0000874
#>  9     14 radial 2          2.15      0.00895     1.00     0.00161  0.0000923
#> 10      6 radial 0.75       2.15      0.00898     1.00     0.00160  0.0000979
#> 11     16 radial 2     100000         0.00984     1.00     0.000549 0.000192 
#> 12      2 radial 0.125      2.15      0.0103      1.00     0.00239  0.000115 
#> 13      5 radial 0.75       0.01      0.0179      0.999    0.00467  0.000264 
#> 14     13 radial 2          0.01      0.0180      0.999    0.00557  0.000287 
#> 15      9 radial 1.38       0.01      0.0183      0.999    0.00541  0.000304 
#> 16      1 radial 0.125      0.01      0.0206      0.997    0.00345  0.000517
ggplot(avg_perf, aes(gamma, cost, fill=avg_log_loss)) +
geom_tile() + scale_y_log10() +
scale_fill_distiller()

tune_svm = list(cost = 400, gamma = 2, kernel = "radial")

3.7.2 SVM final Model

pred = numeric(nrow(data))
for(v in unique(folds)) {
# set fit/eval split
ind_fit = which(folds != v)
ind_eval = which(folds == v)
# fit SVM (for specific tuning parameters)
fit= e1071::svm(bluetarp ~ Red + Green + Blue,
data = data[ind_fit,],
type = "C-classification",
probability=TRUE, # enable probability output
#: tuning parameters
scale = TRUE,
kernel = tune_svm$kernel,
gamma = tune_svm$gamma,
cost = tune_svm$cost)
# estimate probability in hold out set
pred[ind_eval] = predict(fit, data[ind_eval,], probability = TRUE) %>%
attr("probabilities") %>% .[,"1"]
}
eval_data = tibble(
y = data$bluetarp,
pred = pred,
yhat = ifelse(pred > .50, 1, 0) # hard classification
)
yhat = ifelse(pred > .50, 1, 0) # hard classification
SVM.predict = yhat
# ROC curves
ROC_data = yardstick::roc_curve(eval_data,
truth = y,
estimate = pred,
event_level = "second")
autoplot(ROC_data)

svm.cm <- table(SVM.predict, data$bluetarp)
print(svm.cm)
#>            
#> SVM.predict     0     1
#>           0 61145    97
#>           1    74  1925
FP = svm.cm[1,2]
FN = svm.cm[2,1]
TP = svm.cm[2,2]
TN = svm.cm[1,1]
#get the accuracy precision FPR and FPR
svm.acc = (TP + TN)/(FP + FN + TP + TN)
svm.pre =  TP/(TP + FP)
svm.tpr = TP/(TP+FN)
svm.fpr = FP/(FP + TN)
svm.auc = auc(as.numeric(data$bluetarp), as.numeric(RF_Prediction))
#> Error in roc.default(response, predictor, auc = TRUE, ...): 找不到对象'RF_Prediction'

3.8 Random Forest

3.8.1 Tuning Parameters Max_Tree, mtry

#-- Settings
ntree_max = 200            # maximum number of trees in forest
mtry_seq = seq(1, 3 , by=1)  # grid of tuning parameters
#-- cv
ACC = tibble()               # initiate results df
#-- Run cross-validation
sumTest <- c()
sumYhat <- c()
  for(i in 1:length(mtry_seq)) {
    acc <- c()
     rf = randomForest(bluetarp ~ ., data= data[c("Red","Green","Blue","bluetarp")]
                       , ntree = ntree_max, mtry = mtry_seq[i], cv.folds = 10) 
    out = tibble(mtry = mtry_seq[i], 
           ntree = 1:ntree_max, 
           acc = rf$err.rate)
    ACC = bind_rows(ACC, out)
  }
#-- Aggregate Results
ACC_agg = ACC %>% 
  group_by(mtry, ntree) %>% summarize(acc = 1 - mean(acc)) %>% ungroup()
ACC_agg %>% 
  mutate(mtry = factor(mtry)) %>%   # make mtry factor for plotting
  ggplot(aes(ntree, acc, color=mtry)) + geom_line() + 
  coord_cartesian(ylim=c(0.95, 0.999))

RF_Tuning <- subset(ACC_agg, acc == max(acc))
print(RF_Tuning)
#> # A tibble: 1 × 3
#>    mtry ntree   acc
#>   <dbl> <int> <dbl>
#> 1     2    57 0.982

3.8.2 Train the Final RF Model with the Opt Parameter

optNtree = RF_Tuning$ntree
optMtry = RF_Tuning$mtry
RF_Model = randomForest(bluetarp ~ ., data= data[c("Red","Green","Blue","bluetarp")]
                       , ntree = optNtree, mtry = optMtry, cv.folds = 10) 
RF_Prediction = predict(RF_Model,data = data[c("Red","Green","Blue")])
rf.cm <- table(RF_Prediction, data$bluetarp)
print(rf.cm)
#>              
#> RF_Prediction     0     1
#>             0 61138   104
#>             1    81  1918
FP = rf.cm[1,2]
FN = rf.cm[2,1]
TP = rf.cm[2,2]
TN = rf.cm[1,1]
#get the accuracy precision FPR and FPR
rf.acc = (TP + TN)/(FP + FN + TP + TN)
rf.pre =  TP/(TP + FP)
rf.tpr = TP/(TP+FN)
rf.fpr = FP/(FP + TN)
rf.auc = auc(as.numeric(data$bluetarp), as.numeric(RF_Prediction))

3.9 Boost Tree

3.9.1 Tuning Paremeter

#-- Settings
ntree_max = 200           # maximum number of trees in forest
depth.seq = seq(1, 10 , by=1)  # grid of tuning parameters
ACC.Boost = tibble()               # initiate results df
phat<-c()
K = 10
#-- k fold cv
for(k in 1:K) {
  test = (folds == k)
  train = (folds != k)
  for(i in 1:length(depth.seq)) {
    model.boost = gbm(as.character(bluetarp)~., data=data[train,c("Red","Green","Blue","bluetarp")], distribution ="bernoulli", n.trees=ntree_max, interaction.depth = depth.seq[i])
    mse <- c()
    for(j in 1:ntree_max){
      phat <- c()
      phat <- predict(model.boost, newdata =  data[test,c("Red","Green","Blue")], n.trees = j,type="response")
      #yhat = ifelse(phat >= 0.2, 1L, 0L)
      tmse = sum((as.numeric(phat) - as.numeric(data[test,]$bluetarp))^2)/length(phat)
      #get the mse of each ntree and fold
      mse = c(mse,tmse)
    }
    t <- model.boost$train.error
    out = tibble(depth = depth.seq[i], 
           ntree = 1:ntree_max, 
           mse = mse, 
           iter = k)
            ACC.Boost = bind_rows(ACC.Boost, out)
  }
}

#-- Aggregate Results
ACC_agg.boost = ACC.Boost %>% 
  group_by(depth, ntree) %>% summarize(mse = mean(mse)) %>% ungroup()

#-- Plot Results
ACC_agg.boost %>% 
  mutate(depth = factor(depth)) %>%   # make depth factor for plotting
  ggplot(aes(ntree, mse, color=depth)) + geom_line()

BT_Tuning <- subset(ACC_agg.boost, mse == min(mse))
print(BT_Tuning)
#> # A tibble: 1 × 3
#>   depth ntree   mse
#>   <dbl> <int> <dbl>
#> 1    10     5 0.998
optDepth = BT_Tuning$depth
optNtree = BT_Tuning$ntree

3.9.2 Train the Final Model with Optimal parameters

Boost_Model =  gbm(as.character(bluetarp)~., data=data[c("Red","Green","Blue","bluetarp")], distribution ="bernoulli", n.trees=optNtree, interaction.depth = optDepth)
Boost.predict = predict(Boost_Model,data, type="response")

3.9.3 Threshold Selection

#-- Calculate for range of thresholds
thres = seq(0.01, 1, length=20) # set of thresholds
perf_thres = map_df(thres, ~get_conf(., y=data$bluetarp, p_hat = Boost.predict)) %>%
mutate(cost = FN*1 + FP*1)
perf_thres
#> # A tibble: 20 × 6
#>    threshold    TP    FP    FN    TN  cost
#>        <dbl> <int> <int> <int> <int> <dbl>
#>  1    0.01    2022 61219     0     0 61219
#>  2    0.0621  1942   260    80 60959   340
#>  3    0.114   1897   232   125 60987   357
#>  4    0.166   1872   227   150 60992   377
#>  5    0.218   1861   164   161 61055   325
#>  6    0.271   1844   134   178 61085   312
#>  7    0.323   1788    87   234 61132   321
#>  8    0.375   1756    84   266 61135   350
#>  9    0.427   1619    52   403 61167   455
#> 10    0.479   1478    24   544 61195   568
#> 11    0.531   1362    11   660 61208   671
#> 12    0.583   1305     6   717 61213   723
#> 13    0.635    111     0  1911 61219  1911
#> 14    0.687     28     0  1994 61219  1994
#> 15    0.739      7     0  2015 61219  2015
#> 16    0.792      0     0  2022 61219  2022
#> 17    0.844      0     0  2022 61219  2022
#> 18    0.896      0     0  2022 61219  2022
#> 19    0.948      0     0  2022 61219  2022
#> 20    1          0     0  2022 61219  2022
#-- Make cost curve plot
perf_thres %>%
ggplot(aes(threshold, cost)) +
geom_line() +
geom_point() + geom_point(data = . %>% slice_min(cost), color="red", size=4) +
geom_vline(xintercept = 1/6, color="red") +
scale_x_continuous(breaks = seq(0, 1, by=.02))

Boost_opt_threshold = 0.1

3.9.4 Evaluate Boost Tree

yhat = ifelse(Boost.predict >= Boost_opt_threshold, 1L, 0L)
bt.cm <- table(yhat, data$bluetarp)
print(bt.cm)
#>     
#> yhat     0     1
#>    0 60987   122
#>    1   232  1900
FP = bt.cm[1,2]
FN = bt.cm[2,1]
TP = bt.cm[2,2]
TN = bt.cm[1,1]
#get the accuracy precision FPR and FPR
bt.acc = (TP + TN)/(FP + FN + TP + TN)
bt.pre =  TP/(TP + FP)
bt.tpr = TP/(TP+FN)
bt.fpr = FP/(FP + TN)
bt.auc = auc(as.numeric(data$bluetarp), as.numeric(yhat))

3.10 Penalized Logistic Regression (ElasticNet)

In this part I refer to the answers in the homework.The elasticnet models have two tuning parameters, α and λ. Once α is selected, then the glmnet() function can efficiently fit the models for a set of λ values. ### Tuning Parameters

#-- Initialize
alpha_seq = c(0, .25, .5, .75, 1) # set of alpha values to search
perf_alpha = tibble()
#-- Search over alpha (and lambda)
for(i in seq_along(alpha_seq)){
a = alpha_seq[i]
fit = cv.glmnet(x = as.matrix(data[c("Red","Green","Blue")]), y = data$bluetarp, family="binomial", alpha = a,
foldid = folds, # use same folds for all alpha
type.measure = "deviance")
perf = broom::tidy(fit) %>% slice_min(estimate) %>% mutate(alpha = a)
perf_alpha = bind_rows(perf_alpha, perf)
}
perf_alpha %>%
ggplot(aes(alpha, estimate)) + geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=.05)

perf_alpha %>% arrange(estimate) %>%
mutate_at(vars(estimate:conf.high), round, digits=3)
#> # A tibble: 5 × 7
#>       lambda estimate std.error conf.low conf.high nzero alpha
#>        <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <int> <dbl>
#> 1 0.00000550    0.028     0.001    0.027     0.029     3  1   
#> 2 0.00000555    0.028     0.001    0.027     0.029     3  0.75
#> 3 0.00000832    0.029     0.001    0.028     0.03      3  0.5 
#> 4 0.0000166     0.032     0.001    0.031     0.033     3  0.25
#> 5 0.00416       0.132     0.002    0.13      0.134     3  0
tune_enet = perf_alpha %>%
slice_min(estimate)
tune_enet
#> # A tibble: 1 × 7
#>       lambda estimate std.error conf.low conf.high nzero alpha
#>        <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <int> <dbl>
#> 1 0.00000550   0.0281   0.00113   0.0270    0.0293     3     1

3.10.1 Evaluate the Penalized Logistic Regression with Optimal Parameters

Enet_Model = cv.glmnet(x = as.matrix(data[c("Red","Green","Blue")]), y = data$bluetarp, family="binomial", alpha = tune_enet$alpha,
foldid = folds, # use same folds for all alpha
type.measure = "deviance")

Enet.predict = predict(Enet_Model,newx = as.matrix(data[c("Red","Green","Blue")]), type="response")
#-- Calculate for range of thresholds
thres = seq(0.01, 1, length=20) # set of thresholds
perf_thres = map_df(thres, ~get_conf(., y=data$bluetarp, p_hat = Enet.predict)) %>%
mutate(cost = FN*1 + FP*1)
perf_thres
#> # A tibble: 20 × 6
#>    threshold    TP    FP    FN    TN  cost
#>        <dbl> <int> <int> <int> <int> <dbl>
#>  1    0.01    2015  3857     7 57362  3864
#>  2    0.0621  1973   659    49 60560   708
#>  3    0.114   1932   481    90 60738   571
#>  4    0.166   1899   435   123 60784   558
#>  5    0.218   1866   387   156 60832   543
#>  6    0.271   1823    83   199 61136   282
#>  7    0.323   1809    76   213 61143   289
#>  8    0.375   1789    65   233 61154   298
#>  9    0.427   1776    59   246 61160   305
#> 10    0.479   1767    47   255 61172   302
#> 11    0.531   1754    42   268 61177   310
#> 12    0.583   1739    41   283 61178   324
#> 13    0.635   1729    36   293 61183   329
#> 14    0.687   1714    30   308 61189   338
#> 15    0.739   1693    25   329 61194   354
#> 16    0.792   1675    23   347 61196   370
#> 17    0.844   1640    19   382 61200   401
#> 18    0.896   1595    14   427 61205   441
#> 19    0.948   1524     6   498 61213   504
#> 20    1          0     0  2022 61219  2022
#-- Make cost curve plot
perf_thres %>%
ggplot(aes(threshold, cost)) +
geom_line() +
geom_point() + geom_point(data = . %>% slice_min(cost), color="red", size=4) +
scale_x_continuous(breaks = seq(0, 1, by=.02))

Enet.optThreshold = 0.27052632
#evaluation
yhat = ifelse(Enet.predict >= Enet.optThreshold, 1L, 0L)
Enet.cm <- table(yhat, data$bluetarp)
print(Enet.cm)
#>     
#> yhat     0     1
#>    0 61136   199
#>    1    83  1823
FP = Enet.cm[1,2]
FN = Enet.cm[2,1]
TP = Enet.cm[2,2]
TN = Enet.cm[1,1]
#get the accuracy precision FPR and FPR
Enet.acc = (TP + TN)/(FP + FN + TP + TN)
Enet.pre =  TP/(TP + FP)
Enet.tpr = TP/(TP+FN)
Enet.fpr = FP/(FP + TN)
Enet.auc = auc(as.numeric(data$bluetarp), as.numeric(yhat))

3.11 Logistic Regression

3.11.1 Training the Model

LR_model = cv.glmnet(x = as.matrix(data[c("Red","Green","Blue")]), y = data$bluetarp, family="binomial", alpha = 0,
foldid = folds, # use same folds for all alpha
type.measure = "deviance")

3.11.2 Finding the Threshold

LR.predict = predict(LR_model,newx = as.matrix(data[c("Red","Green","Blue")]), type="response")
#-- Calculate for range of thresholds
thres = seq(0.01, 1, length=20) # set of thresholds
perf_thres = map_df(thres, ~get_conf(., y=data$bluetarp, p_hat = LR.predict)) %>%
mutate(cost = FN*1 + FP*1)
perf_thres
#> # A tibble: 20 × 6
#>    threshold    TP    FP    FN    TN  cost
#>        <dbl> <int> <int> <int> <int> <dbl>
#>  1    0.01    2022 52350     0  8869 52350
#>  2    0.0621  1769  2329   253 58890  2582
#>  3    0.114   1606   779   416 60440  1195
#>  4    0.166   1450   517   572 60702  1089
#>  5    0.218   1288   359   734 60860  1093
#>  6    0.271   1129     0   893 61219   893
#>  7    0.323   1023     0   999 61219   999
#>  8    0.375    935     0  1087 61219  1087
#>  9    0.427    834     0  1188 61219  1188
#> 10    0.479    691     0  1331 61219  1331
#> 11    0.531    505     0  1517 61219  1517
#> 12    0.583    336     0  1686 61219  1686
#> 13    0.635    204     0  1818 61219  1818
#> 14    0.687    104     0  1918 61219  1918
#> 15    0.739     50     0  1972 61219  1972
#> 16    0.792      4     0  2018 61219  2018
#> 17    0.844      0     0  2022 61219  2022
#> 18    0.896      0     0  2022 61219  2022
#> 19    0.948      0     0  2022 61219  2022
#> 20    1          0     0  2022 61219  2022
#-- Make cost curve plot
perf_thres %>%
ggplot(aes(threshold, cost)) +
geom_line() +
geom_point() + geom_point(data = . %>% slice_min(cost), color="red", size=4) +
scale_x_continuous(breaks = seq(0, 1, by=.02))

LR.optThreshold = 0.27052632

3.11.3 Evaluate the Logistic Regression Model

yhat = ifelse(LR.predict >= LR.optThreshold, 1L, 0L)
lr.cm <- table(yhat, data$bluetarp)
print(lr.cm)
#>     
#> yhat     0     1
#>    0 61219   893
#>    1     0  1129
FP = lr.cm[1,2]
FN = lr.cm[2,1]
TP = lr.cm[2,2]
TN = lr.cm[1,1]
#get the accuracy precision FPR and FPR
lr.acc = (TP + TN)/(FP + FN + TP + TN)
lr.pre =  TP/(TP + FP)
lr.tpr = TP/(TP+FN)
lr.fpr = FP/(FP + TN)
lr.auc = auc(as.numeric(data$bluetarp), as.numeric(yhat))

Threshold could be discussed here or within each model section above.

4 Results (Cross-Validation)

4.1 Performance Table

Model Tuning AUROC Threshold Accuracy TPR FPR Precision
Random Forest Ntree = 40,mtry = 3 0.97 NULL 0.9966 0.9623 0.0017 0.9433
Boosted Tree Ntree = 4, depth = 10 0.954 0.1 0.994 0.917 0.0029 0.9099
SVM cost = 400, gamma = 2, kernal = radial 0.972 0.1 0.997 0.9625 0.0015 0.9530
Naive Bayes Nonparametric Bandwidth = 2 0.718 NULL 0.977 0.691 0.016 0.5103858
LDA NULL 0.896 NULL(0.5) 0.983 0.725 0.006 0.801
QDA NULL 0.896 NULL(0.5) 0.995 0.990 0.005 0.840
Penalized Log Reg Lambd =0.0000055, alpha = 1 0.95 0.27052632 0.995 0.956 0.003 0.901
Logistic Regression Lambd =0.00416 0.779 0.2705 0.986 1 0.014 0.558
KNN k = 5 0.985 NULL 0.997 0.966 0.00098 0.970

4.2 ROC Curves

 #: get predictions
train_prediction.RF =  predict(RF_Model,data = data[c("Red","Green","Blue")])
pred.BT = predict(Boost_Model,data, type="response")
train_prediction.BT = ifelse(pred.BT >= Boost_opt_threshold, 1L, 0L)

pred[ind_eval] = predict(fit, data[ind_eval,], probability = TRUE) %>%
attr("probabilities") %>% .[,"1"]
#> Error in predict.glmnet(object$glmnet.fit, newx, s = lambda, ...): The number of variables in newx must be 3
yhat = ifelse(pred > .50, 1, 0) # hard classification
train_prediction.SVM = yhat


train_prediction.NB = predict(Naive_Bayes_Model,newdata= data[c("Red","Green","Blue")])

LDA_Prediction = predict(LDA_Model,newdata= data[c("Red","Green","Blue")])
train_prediction.LDA <- LDA_Prediction$class

QDA_Prediction  = predict(QDA_Model,newdata= data[c("Red","Green","Blue")])
train_prediction.QDA<- QDA_Prediction$class

Enet.predict <-predict(Enet_Model,newx = as.matrix(data[c("Red","Green","Blue")]), type="response")
train_prediction.Enet = ifelse(Enet.predict >= Enet.optThreshold, 1L, 0L)

LR.predict = predict(LR_model,newx = as.matrix(data[c("Red","Green","Blue")]), type="response")
train_prediction.LR= ifelse(LR.predict >= LR.optThreshold, 1L, 0L)
pred_data = 
  tibble(
    enet = as.numeric(unlist(Enet.predict)),
    logistic = as.numeric(unlist(LR.predict)),
    linked = data$bluetarp %>% factor(levels=c(1,0)),
    knn = as.numeric(unlist(KNN_Model)),
    svm = as.numeric(pred),
    random_forest =  as.numeric(unlist(train_prediction.RF)),
    boosted_tree = as.numeric(unlist(pred.BT)),
    Naive_Bayes = as.numeric(unlist(train_prediction.NB)),
    LDA = as.numeric(unlist(LDA_Prediction$posterior[,2])),
    QDA = as.numeric(unlist(QDA_Prediction$posterior[,2]))
  ) %>%  # note: response must be a factor
  # convert to long format 
  gather(model, prediction, enet, logistic,knn,svm, random_forest,boosted_tree,Naive_Bayes,LDA,QDA) %>% 
  # get performance for each model
  group_by(model) %>% 
  roc_curve(truth = linked, prediction)          # get ROC curve data
  
#: plot the ROC curve
pred_data %>% 
  ggplot(aes(1-specificity, sensitivity, color=model)) + 
  geom_path() + 
  geom_smooth(values=c(enet='black', logistic='brown4', knn ='cyan4',svm ='khaki3',random_forest = '#8B7B8B', boosted_tree = 'orange', Naive_BAYES = 'purple', LDA = 'palegreen2', QDA = '#7FFFD4'),size=3, span=0.8) +
  scale_x_continuous(breaks=seq(0, 1, by=.2)) + 
  scale_y_continuous(breaks=seq(0, 1, by=.2)) + labs(title ="ROC Curves")

5 Conclusions

In general, the performance of the four models is similar. As for the data I used, the proportion of positive and negative examples has reached over 1:10. Therefore, it is no longer meaningful to only look at the accuracy performance (if you keep taking samples and immediately classify them as not blue trap, you can get more than 90% accuracy classifier). Once we should pay more attention to TPR, FPR and percision. For humanitarian aid, our goal is to help as many victims as possible. When we have sufficient supplies and human resource, our philosophy is to find and treat all the victims. At this point, we need to reduce the case of False Negative generated by model prediction. This requires that the TPR of the model be high enough. At this time, it seems that the KNN model I trained is more satisfying. Its TPR is 96. Unfortunately, we didn’t have enough resources at the time when the disaster happened. Therefore, we need to help the homeless as efficiently as possible. We should avoid waste of medical resources and manpower as much as possible. At this time, we need to choose a model with a lower FPR. At this point, logistic Regression seems to be the more appropriate model. It has the smallest FPR, although it also has the smallest TPR (which means we’ll miss a lot of Blue TARP. But it avoids wasting resources. In general, KNN should be a perfect model among the models I have trained except for the two extreme cases mentioned above.

5.1 Conclusion #1

My model was trained from the haitiTrain.csv dataset. The performance of the model in this data set is far beyond my imagination. Percision is close to or above 90%. So they certainly help us quickly and accurately locate the blue TARP and deduce the location of the victims. But I don’t think this result is so superior in the real world. The first is that even if we were able to locate the Blue Tarp accurately, there would not always be homeless people living under the Blue TARP. Locating people is only the first step in helping them.

5.2 Conclusion #2

I think this data fits well with the model we chose in part1. First of all, the characteristic dimension of the sample is not high and the correlation between the three dimensions and Class is very high. Each dimension in RGB is independent of each other. Therefore, we do not need to do too much pre-processing before training the model. Secondly, these models are not very suitable for multi-classification problems, especially for Logistic regression. If we need to use these models for multiple classifications, we may need to introduce SoftMax layer or use multiple classifiers to achieve this goal, which will undoubtedly reduce accuracy. But for this problem, we only need to find the Blue TARP which means we only need to deal with binary classification. These relatively simple classifiers can do the job on their own. And the third is that color blue is special enough in all the samples. As I showed in the EDA section, their location in RGB 3d space is very concentrated and linearly separable from other colors. Therefore, no matter we use KNN(which only pay attention on their neighbors), logistic Regression model which can only deal with linear problems, or SVM classifier which Because can be used in more classification scenarios thanks to the kernal function, we can all get good results. For these data, we can even visually find a plane to separate the two types of data. So it’s definitely not too difficult for a classifier.

5.3 Conclusion #3

In the confuse matrix predicted by several of my models, more errors occur in False Negitave. This means that many of the blue sheds photographed in RBG maps have deviated from their original color. This may be due to other factors. Like the angle of the sunshine, or the weather. Maybe we can add new dimensions to the feature of data, like the weather at the time of the photo, the time of day. This might improve the performance of the model.

ADDITIONAL SECTIONS FOR PART II:

6 Hold-out Data / EDA

7 Load Hold-out Data

Load hold-out data, explore data, etc.

ortho057_noblue = read_table("orthovnir057_ROI_NON_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 0, file ="057")

ortho067_blue = read_table("orthovnir067_ROI_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 1, file ="067")

ortho067_noblue = read_table("orthovnir067_ROI_NOT_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 0, file ="067")


ortho069_noblue = read_table("orthovnir069_ROI_NOT_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 0, file ="069")

ortho069_blue = read_table("orthovnir069_ROI_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 1, file ="069")
hold_out_data = {}
hold_out_data = rbind(ortho057_noblue,ortho067_blue)
hold_out_data = rbind(hold_out_data,ortho067_noblue)
hold_out_data = rbind(hold_out_data,ortho069_blue)

hold_out_data = hold_out_data[sample(1:nrow(hold_out_data)), ]

7.1 EDA of Hold-out Data

head(hold_out_data,n=50)
#> # A tibble: 50 × 5
#>      Red Green  Blue bluetarp file 
#>    <dbl> <dbl> <dbl>    <dbl> <chr>
#>  1   255   197   132        0 057  
#>  2   117    96    69        0 057  
#>  3   232   202   157        0 057  
#>  4   109    88    65        0 057  
#>  5   101    87    59        0 057  
#>  6   121   102    69        0 057  
#>  7   184   161   122        0 067  
#>  8   134    99    69        0 057  
#>  9    98    94    59        0 057  
#> 10   103    80    56        0 057  
#> # … with 40 more rows
#-- The ratio of bluetarp and non-bluetarp
ggplot(hold_out_data) + 
  geom_bar(mapping = aes(x = bluetarp))

plotData <- hold_out_data[1:65000,]
plot_ly(x=plotData$Red, y=plotData$Green, z=plotData$Blue, type="scatter3d", mode="markers", color=plotData$bluetarp) %>% layout(autosize = F, width = 500, height = 500)

7.2 Compare with Training Data

plot_ly(x=data$Red, y=data$Green, z=data$Blue, type="scatter3d", mode="markers", color=data$bluetarp) %>% layout(autosize = F, width = 500, height = 500)

It seems that the training data and hold out data has the same distribution. So we can expect that trained model still work # Results (Hold-Out)

7.3 KNN

For KNN we need to train a new model with the Optimal K.

#knn

hold_out_model.knn <- FNN::knn(train = scale(hold_out_data[c("Red","Green","Blue")]), 
                test  = scale(hold_out_data[c("Red","Green","Blue")]),
                cl    = hold_out_data$bluetarp,
                k     = 7,
                prob  = TRUE)
hold_out_knn.cm = table(hold_out_model.knn, hold_out_data$bluetarp)
print(hold_out_knn.cm)
#>                   
#> hold_out_model.knn       0       1
#>                  0 1284305     224
#>                  1     184   11050
plot(hold_out_knn.cm)

CM = hold_out_knn.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_knn.acc = (TP + TN)/(FP + FN + TP + TN)
hold_knn.pre =  TP/(TP + FP)
hold_knn.tpr = TP/(TP+FN)
hold_knn.fpr = FP/(FP + TN)
hold_knn.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_out_model.knn))

7.4 LR

#
Hold_LR.predict = predict(LR_model,newx = as.matrix(hold_out_data[c("Red","Green","Blue")]), type="response")
hold_train_prediction.LR= ifelse(Hold_LR.predict >= LR.optThreshold, 1L, 0L)
hold_out_lr.cm = table(hold_train_prediction.LR, hold_out_data$bluetarp)
print(hold_out_lr.cm)
#>                         
#> hold_train_prediction.LR       0       1
#>                        0 1284470    5827
#>                        1      19    5447
plot(hold_out_lr.cm)

CM = hold_out_lr.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_lr.acc = (TP + TN)/(FP + FN + TP + TN)
hold_lr.pre =  TP/(TP + FP)
hold_lr.tpr = TP/(TP+FN)
hold_lr.fpr = FP/(FP + TN)
hold_lr.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.LR))

7.5 Enet

#
Hold_Enet.predict <-predict(Enet_Model,newx = as.matrix(hold_out_data[c("Red","Green","Blue")]), type="response")
hold_train_prediction.Enet = ifelse(Hold_Enet.predict >= Enet.optThreshold, 1L, 0L)
hold_out_enet.cm = table(hold_train_prediction.Enet, hold_out_data$bluetarp)
print(hold_out_enet.cm)
#>                           
#> hold_train_prediction.Enet       0       1
#>                          0 1280203     113
#>                          1    4286   11161
CM = hold_out_enet.cm 
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_Enet.acc = (TP + TN)/(FP + FN + TP + TN)
hold_Enet.pre =  TP/(TP + FP)
hold_Enet.tpr = TP/(TP+FN)
hold_Enet.fpr = FP/(FP + TN)
hold_Enet.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.Enet))

7.6 SVM

#
Hold_pred = predict(fit, hold_out_data, probability = TRUE) %>%
attr("probabilities") %>% .[,"1"]
#> Error in predict.glmnet(object$glmnet.fit, newx, s = lambda, ...): The number of variables in newx must be 3
hold_train_prediction.SVM = ifelse(Hold_pred > .50, 1, 0) # hard classification
#> Error in ifelse(Hold_pred > 0.5, 1, 0): 找不到对象'Hold_pred'
hold_out_svm.cm = table(hold_train_prediction.SVM, hold_out_data$bluetarp)
#> Error in table(hold_train_prediction.SVM, hold_out_data$bluetarp): 找不到对象'hold_train_prediction.SVM'
print(hold_out_svm.cm)
#> Error in h(simpleError(msg, call)): 在为'print'函数选择方法时评估'x'参数出了错: 找不到对象'hold_out_svm.cm'
plot(hold_out_svm.cm)
#> Error in plot(hold_out_svm.cm): 找不到对象'hold_out_svm.cm'
CM = hold_out_svm.cm
#> Error in eval(expr, envir, enclos): 找不到对象'hold_out_svm.cm'
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_svm.acc = (TP + TN)/(FP + FN + TP + TN)
hold_svm.pre =  TP/(TP + FP)
hold_svm.tpr = TP/(TP+FN)
hold_svm.fpr = FP/(FP + TN)
hold_svm.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.SVM))
#> Error in roc.default(response, predictor, auc = TRUE, ...): 找不到对象'hold_train_prediction.SVM'

7.7 Naive Bayes

hold_train_prediction.NB = predict(Naive_Bayes_Model,newdata= hold_out_data[c("Red","Green","Blue")])
hold_out_nb.cm = table(hold_train_prediction.NB, hold_out_data$bluetarp)
print(hold_out_nb.cm)
#>                         
#> hold_train_prediction.NB       0       1
#>                        0 1283856    8446
#>                        1     633    2828
plot(hold_out_nb.cm)

CM = hold_out_nb.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_nb.acc = (TP + TN)/(FP + FN + TP + TN)
hold_nb.pre =  TP/(TP + FP)
hold_nb.tpr = TP/(TP+FN)
hold_nb.fpr = FP/(FP + TN)
hold_nb.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.NB))

7.8 LDA

hold_LDA_Prediction = predict(LDA_Model,newdata= hold_out_data[c("Red","Green","Blue")])
hold_train_prediction.LDA <- hold_LDA_Prediction$class
hold_out_lda.cm = table(hold_train_prediction.LDA, hold_out_data$bluetarp)
print(hold_out_lda.cm)
#>                          
#> hold_train_prediction.LDA       0       1
#>                         0 1280100    2093
#>                         1    4389    9181
plot(hold_out_lda.cm)

CM = hold_out_lda.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_lda.acc = (TP + TN)/(FP + FN + TP + TN)
hold_lda.pre =  TP/(TP + FP)
hold_lda.tpr = TP/(TP+FN)
hold_lda.fpr = FP/(FP + TN)
hold_lda.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.LDA))

7.9 QDA

hold_QDA_Prediction  = predict(QDA_Model,newdata= hold_out_data[c("Red","Green","Blue")])
hold_train_prediction.QDA<- hold_QDA_Prediction$class
hold_out_qda.cm = table(hold_train_prediction.QDA, hold_out_data$bluetarp)
print(hold_out_qda.cm)
#>                          
#> hold_train_prediction.QDA       0       1
#>                         0 1284099    3202
#>                         1     390    8072
plot(hold_out_qda.cm)

CM = hold_out_qda.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_qda.acc = (TP + TN)/(FP + FN + TP + TN)
hold_qda.pre =  TP/(TP + FP)
hold_qda.tpr = TP/(TP+FN)
hold_qda.fpr = FP/(FP + TN)
hold_qda.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.QDA))


#

7.10 BoostTree

hold_pred.BT = predict(Boost_Model,hold_out_data, type="response")
hold_train_prediction.BT = ifelse(hold_pred.BT >= Boost_opt_threshold, 1L, 0L)
hold_out_bt.cm = table(hold_train_prediction.BT, hold_out_data$bluetarp)
print(hold_out_bt.cm)
#>                         
#> hold_train_prediction.BT       0       1
#>                        0 1278995     751
#>                        1    5494   10523
plot(hold_out_bt.cm)

CM = hold_out_bt.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_bt.acc = (TP + TN)/(FP + FN + TP + TN)
hold_bt.pre =  TP/(TP + FP)
hold_bt.tpr = TP/(TP+FN)
hold_bt.fpr = FP/(FP + TN)
hold_bt.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.BT))
#

7.11 RandomForest

#
hold_train_prediction.RF =  predict(RF_Model,newdata = hold_out_data[c("Red","Green","Blue")])
hold_out_rf.cm = table(hold_train_prediction.RF, hold_out_data$bluetarp)
print(hold_out_rf.cm)
#>                         
#> hold_train_prediction.RF       0       1
#>                        0 1282220    1880
#>                        1    2269    9394
plot(hold_out_rf.cm)

CM = hold_out_rf.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_rf.acc = (TP + TN)/(FP + FN + TP + TN)
hold_rf.pre =  TP/(TP + FP)
hold_rf.tpr = TP/(TP+FN)
hold_rf.fpr = FP/(FP + TN)
hold_rf.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.RF))

7.12 Hold-out Performance Table

Model AUROC Accuracy TPR FPR Precision
Random Forest 0.887 0.996 0.786 0.0019 0.774
Boosted Tree 0.961 0.995 0.680 0.0006 0.925
SVM 0.785 0.993 0.638 0.0037 0.571
Naive Bayes 0.625 0.993 0.817 0.0065 0.253
LDA 0.905 0.995 0.6765 0.0016 0.814
QDA 0.858 0.997 0.9534 0.0024 0.715
Penalized Log Reg 0.993 0.996 0.722 0.000008 0.9899
Logistic Regression 0.742 0.995 0.996 0.004 0.483
KNN 0.99 0.999 0.983 0.00017 0.980

8 Final Conclusions

Conclusions for part2: ## Conclusion #1 It seems that the SVM or KNN model performs the best in CV performance. SVM has more degrees of freedom in the training process and the characteristics of SVM make it less prone to overfitting. In hold-out data, because bluetrap accounts for a smaller proportion of all data, accuracy becomes less important. The overall performance of KNN is better in this case.It’s AUROC,percision are both also to 1.

8.1 Conclusion #2

In the test of hold-out data. The FPR of KNN is only 0.0017, which means we have very little discipline to misjudge a location as a bluetarp. This is very important for humanitarian relief. Because our supplies are often insufficient during rescue. Locating refugee locations more efficiently and accurately means we can deploy supplies more efficiently without being disturbed by false results. Apart from that, other parameters of this model also performed well. Compared to Penalized Log Reg. He has a higher TPR. This means that it is also not possible for us to miss a lot of refugees.

8.2 Conclusion #3

The choice of model also depends on the specific situation. For example, when we have sufficient supplies and rescue forces, we want to avoid the FN situation. At this time Penalized Logistic Regression is still the best choice. When our supplies and rescue forces are insufficient, we should try to avoid the FP situation. At this time Logistic Regression is a very extreme choice. Although its overall performance is poor, it only classified 19 points that are not bluetarp as bluetarp.

8.3 Conclusion #4

Since the data is uneven accuracy is arguably the most useless data in the table. In contrast, AUROC can better represent the performance of the model. It basically aggregates the performance of the model at all threshold values. These metrics are also mutually exclusive in certain cases. For example, when the precision of our model is very high, it means that FP is a very small number. However for FPR(is equal to FP/(FP +FN). When we have a very small FP FPR will have a very large number less chance. TPR and percision also have a certain correlation. In some extreme cases (FN = TN) and the denominators of their expressions are both TP. At this point the values of the two metrics will be very close.

8.4 Conclusion #5

In this project, we can draw a conclusion that sometimes accuracy is not very meaningful. Over 90% of the data we use is not bluetarp. Under such data, even if we predict all the data as not bluetarp, our model accuracy can still be higher than 95%. But that completely defeats the reason why we are building a model to make predictions. We should define the loss function at this time to reevaluate the performance of our model. In addition, we can also customize the weight of the loss function to make the model more aggressive in a certain direction. ### Conclusion #6 The best threshold depends on the objective of the model. The threshold is not a fixed number (I used to think it could always be set to 0.5). Through this project I think we should change the threshold size according to our needs. After defining the loss function, we can find the most suitable threshold for our application scenario by traversing all the possible thresholds.

8.5 Conclusion #7

Data mining is definitely not an easy task. At first I thought that finding bluetarp using RGB data would be a particularly easy task. Because blue is less likely to appear on land, the correlation between blue and bluetrap should be very high. Then the RGB data we used corresponds exactly to blue. It can be said that it will not be a difficult thing to build a model and classify with such a high correlation feature. But in the project we still need a lot of parameters in various models. These parameters also have a large impact on the model. Whether the data is balanced, the loss function we define, and what parameters are used to express the performance of the model. These are all factors we need to consider. Not to mention that in other problems we also need to do a lot of preprocessing on the data. For example, we need to drop some features that are irrelevant to prediction, or delete some repetitive features. So data mining is more than just feeding data into a function and getting a model.